SPACE 1

home *** CD-ROM | disk | FTP | other *** search

/ SPACE 1 / SPACE - Library 1 - Volume 1.iso / program / 441 / fplib20 / dnotes < prev next >

Wrap

Text File | 1990-11-23 | 5.0 KB | 134 lines

Design notes for FPLIB Release 2.0 ================================== NUMBER REPRESENTATION --------------------- The floating point format used is Motorola's "Fast Floating Point": ----------------------------------------- | 24 bits mantissa |S| 7 bits exp | ----------------------------------------- The mantissa has an implicit binary point just off to the left. The exponent is binary, stored excess-64. In other words, the actual mathematical exponent has 64 added, and the resulting 7-bit number is unsigned. Negative numbers are stored as sign-and-magnitude. All floating-point operations expect normalized numbers, and generate normalized numbers if the inputs are normalized. A number is normalized if the most significant bit is 1 (except for 0.0, which is stored as 32 bits of zero, never negative). Also, a nonzero mantissa with a biased exponent of 0 is not normalized, letting us test the low byte or whole long as a test for 0.0. This all means that the most significant bit is actually redundant, and could be left off, giving one extra bit of precision. Oh well. Some of these routines make use of the assumption that all arguments are normalized, and may generate slightly bogus results if that isn't true. Release 1.0 assumed that a biased exponent of zero could have a nonzero mantissa; that has been fixed. Some examples: 1.0 is hex 80000041 , or 0.5 * 2 ** 1 (binary 0.1, exponent 1 + bias of 64). -0.75 is hex C00000C0 (note the sign bit is inverted, but no other bit is). ACCURACY AND LIMITS ------------------- The accuracy is in the 24-bit mantissa, which gives between 7.22 and 7.52 decimal digits of relative error. Decimal constants are given to 8 digits, and the compiler is trusted to convert them right (it generally does a reasonable job). The range is in the exponent. The smallest nonzero and largest positive normalized numbers are ~5.4e-20 and ~9.2e18. Rounding of ambiguous numbers uses round-to-even (see D. E. Knuth's Seminumerical Algorithms for a discussion on this). This means that if the trailing portion is exactly one half of the least significant bit, the lsb is forced to be the nearer even number. We often keep several insignificant bits to be sure this is done correctly. CODING TRICKS ------------- It's quite normal in a library of this type to resort to bit-twiddling in the internal representation. To help this, FPLIB.H defines a C union, thus: union { unsigned long ul; float f; char sc[4]; }; The number can be looked at as a floating or integer long (for handling bits in the mantissa), the sign is the sign of sc[3], and the exponent is the rest of sc[3]. Other shortcuts are commented in the inividual routines. MATHEMATICAL INFORMATION ------------------------ Well, this is mostly what I remember from my undergraduate days. Many of these functions have a precise mathematical formula or algorithm that will give the result, but is too slow for practical use. Instead, we look for an approximation that is quick to compute, and gives us just enough precision and no more. After all, why compute to 100 digits accuracy when you can only represent less than 8? Some of these approximations look very strange, but they all have sound mathematical bases. The name Chebyshev springs to mind. Dedicated mathematicians have struggled for decades to bring you these formulae and, just as important, a predictable accuracy of better than 7.5 decimal digits. Because of this, we can execute a known number of operations instead of loop-and-test (as you would in, for example, a simple Newton's iteration for square root). The approximations are generally polynomials, usually avoid division, and are valid only over a small range of the input parameter. We use simple mathematical identities to extend the range. For example, the cyclic nature of the trigonometric functions means we only need calculate in the range 0..pi/2. For another example, we can do much of the square root calculation by the simple expedient of dividing the binary exponent by 2. RANDOM LIBRARY INFORMATION -------------------------- Release 1.0 of Sozobon C had a few problems in the floating support: - Two negative floating numbers compared wrong (returned > when it should be <). I corrected that. However, the other routines don't assume the correction has been installed in the library (which it wasn't, at first...). - The library was not optimally ordered, and had to be scanned twice. I straightened that out. - Add and subtract were truncating. Rounding is generally a better idea. atof() calls sscanf(). Ick. ------------------------------------------------------------------------------- These routines may be used without restriction, but I retain the copyright on them, and the Only True Upgrades will be those I make. If you have any improvements, please send them to me. March 1989 David W. Brooks 36 Elm Street Medfield, MA 02052 BROOKS@CSSS-A.PRIME.COM {uunet,mit-eddie}!csss-a.prime.com!brooks